{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use of Optional Data Curation Features by Harvard Dataverse Users"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Cleaning and Analysis Workflow for Harvard Dataverse Inventory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Author:** C. M. Boyd, ceilyn.boyd@simmons.edu, ceilyn_boyd@harvard.edu\n",
    "- **Created:** 8 June 2020\n",
    "- **Updated:** 6 July 2020"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**About the Data:** Two files serve as raw data for this project: `raw_filedata.csv` and `raw_dataset_data.csv`. These two files reflect the inventory of the Harvard Dataverse repository on 28 October 2019 and 29 October 2019. \n",
    "\n",
    "- `raw_filedata.csv`: Using an SQL query, a file containing a list of all files associated with datasets was created on 28 October 2019. This inventory identified **383,687** files associated with **29,295** datasets. \n",
    "- `raw_dataset_data.csv`: A second SQL query was made on 29 October 2019 to get at list of all datasets at that time. There were **29,997** datasets in the inventory on that day.\n",
    "\n",
    "**Note:** There is 702 item discrepancy between the number of datasets reflected in these two files. Data cleaning and analysis was performed using Python code and Excel in order to generate descriptive statistics about the Harvard Repository on those two days. Ultimately, **29,295** datasets and **383,685** files were identified."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Systems\n",
    "- **Python Environment:** Anaconda Navigator 1.6.9, Python 3.6\n",
    "- **Jupyter Notebook:** Version 5.0.0\n",
    "- **MacOS:** Catalina, 10.15.4 and Mojave, 10.14.6\n",
    "- **Excel for MacOS:** 16.37\n",
    "- **DB Browser for SQLite:** 3.11.2 (https://sqlitebrowser.org) (Used to review SQL and inspect output of SQL statements)\n",
    "- **SPSS for MacOS:** Version 26"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Input Files\n",
    "These two raw input files were created by an IQSS database analyst who generated the SQL queries on the database to capture raw information about datasets and files in Harvard Dataverse. These two files were then cleaned, processed, and analyzed by C. M. Boyd to generate descriptive statistics about the data.\n",
    "\n",
    "- `raw_filedata.csv`: File containing a list of all files associated with datasets was created on 28 October 2019 using an SQL query. This inventory identified **383,687** files associated with **29,296** datasets. \n",
    "- `raw_dataset_data.csv`: File containing a list of all datasets in Harvard Dataverse on 29 October 2019 created using an SQL query. There were **29,997** datasets in the inventory on that day.\n",
    "\n",
    "### Generated Files\n",
    "- `dataset_data_subset.tsv`: The file containing the subset of datasets that were referenced in cleaned_filedata.tsv and all additional fields needed to support statistical analysis using SPSS\n",
    "\n",
    "### Other Output Files\n",
    "- `dataset_data_subset.xlxs`: The manually create, resaved file `dataset_data_subset.tsv`; operation is performed because there is a complex record in the file that contains nested quotes that confuses the SPSS tab-separated-values and comma-separated-values parsers.\n",
    "- `hdofu_dataset_data_subset.sav`: Manually created SPSS project file, result of importing `dataset_data_subset.xlsx` and performing statistical processes upon it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Excel\n",
    "### Clean Raw Data Files\n",
    "#### 1. Clean `raw_dataset_data.csv`\n",
    "Upon inspection, I discovered one poorly formatted entry in `raw_dataset_data.csv` for `dataset_id = 2676050`. Therefore, using Excel, I manually removed this record from `raw_dataset_data.csv` and saved the resulting file as `cleaned_dataset_data.tsv`. Additional cleaning to address NA values is performed before this file is ingested into SQLite3.\n",
    "\n",
    "#### 2. Clean `raw_filedata.csv`\n",
    "To reflect the deletion of the record in `raw_dataset_data.csv` where `dataset_id = 2676050`, I also use Exel to remove all file records associated with `dataset_id = 2676050` from `raw_filedata.csv` to create the new tab delimited file `cleaned_filedata.tsv`. Additional cleaning to address NA values is performed before this file is ingested into SQLite3."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Global Variables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### File-related Variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Import os module for interacting with the operating system\n",
    "import os\n",
    "\n",
    "# Establish base directory for data, code, and jupyter notebook associated with the project\n",
    "curr_dir = os.getcwd()\n",
    "dir_array = curr_dir.split('/')\n",
    "dir_array.pop(len(dir_array)-1)\n",
    "base_dir = '/'.join(dir_array)\n",
    "\n",
    "# Directory for the data quality project\n",
    "project_dir = base_dir\n",
    "\n",
    "# Directory for the SQLite3 database\n",
    "db_dir = project_dir + '/db'\n",
    "\n",
    "# Database name\n",
    "db_name = 'hdofu.db'\n",
    "\n",
    "# Database file path\n",
    "db_path = db_dir + '/' + db_name\n",
    "\n",
    "# Directory for the data quality project data (input and output)\n",
    "data_dir = project_dir + '/data'\n",
    "\n",
    "# Cleaned dataset data file\n",
    "cleaned_dataset_data_tsv = data_dir + '/cleaned/cleaned_dataset_data.tsv'\n",
    "\n",
    "# Cleaned filedata file\n",
    "cleaned_filedata_tsv = data_dir + '/cleaned/cleaned_filedata.tsv'\n",
    "\n",
    "# Target output file for subset of datasets used in SPSS analyses\n",
    "dataset_data_subsets_tsv = data_dir + '/dataset_data_subset.tsv'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import Python Libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Import Pandas library, v. 1.0.3\n",
    "import pandas as pd\n",
    "\n",
    "# Import sQLite3: DB-API 2.0 interface for SQLite databases\n",
    "import sqlite3\n",
    "\n",
    "# Import csv\n",
    "import csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Process Data Files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. Create the `hdofu` database and read in the file `cleaned_dataset_data.tsv` to create the `cleaned_dataset_data` table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Create or open the dataquality database\n",
    "db = sqlite3.connect(db_path)\n",
    "\n",
    "# Use Pandas to read in cleaned_dataset_data.tsv\n",
    "df = pd.read_csv(cleaned_dataset_data_tsv, delimiter='\\t')\n",
    "\n",
    "# Create and/or replace data to cleaned_dataset_data table\n",
    "df.to_sql('cleaned_dataset_data', db, if_exists=\"replace\")\n",
    "\n",
    "# Close the connection\n",
    "db.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Create database and read in `cleaned_filedata.tsv` to create the `cleaned_filedata` table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Open the dataquality database\n",
    "db = sqlite3.connect(db_path)\n",
    "\n",
    "# Use Pandas to read in cleaned_filedata.tsv\n",
    "df = pd.read_csv(cleaned_filedata_tsv, delimiter='\\t')\n",
    "\n",
    "# Fill download_count field with zeros, downcast float to int\n",
    "df.fillna(0,inplace=True,downcast={'int64'})\n",
    "\n",
    "# Create and append data to cleaned_filedata table\n",
    "df.to_sql('cleaned_filedata', db, dtype={'download_count':'INTEGER'}, if_exists=\"replace\")\n",
    "\n",
    "# Close the connection\n",
    "db.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 2.1 Add `cleaned_filedata.is_PSF` column, with values, to `cleaned_filedata` table for possible supplemental files and update `cleaned_filedata.has_PSF` with values from filedata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Open the dataquality database\n",
    "db = sqlite3.connect(db_path)\n",
    "\n",
    "# Get the cursor\n",
    "c = db.cursor()\n",
    "\n",
    "# Alter table sql statement, to add is_PSF field to dataset\n",
    "sql = 'ALTER TABLE cleaned_filedata ADD is_PSF INTEGER;'\n",
    "\n",
    "# Set initial values in the new is_PSF column to 0\n",
    "sql = sql + 'UPDATE cleaned_filedata SET is_PSF = 0;'\n",
    "\n",
    "# Update with possible supplemental files\n",
    "sql = sql + 'UPDATE cleaned_filedata SET is_PSF = 1 WHERE cleaned_filedata.filename LIKE \"%codebook%\" OR cleaned_filedata.filename LIKE \"%code book%\" OR cleaned_filedata.filename LIKE \"%readme%\";'\n",
    "\n",
    "\n",
    "# Execute the sql\n",
    "c.executescript(sql)\n",
    "\n",
    "# Close the connection\n",
    "db.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Summarize `cleaned_filedata` as view. Group `cleaned_filedata` table by `dataset_id`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Open the dataquality database\n",
    "db = sqlite3.connect(db_path)\n",
    "\n",
    "# Get the cursor\n",
    "c = db.cursor()\n",
    "\n",
    "# Create summarized view\n",
    "sql = 'CREATE VIEW filedata_summary_view AS SELECT dataset_id, SUM(download_count) AS download_count, SUM(is_PSF) AS num_PSF, SUM(restricted) AS num_RF FROM cleaned_filedata GROUP BY dataset_id;'\n",
    "\n",
    "# Execute the sql\n",
    "c.execute(sql)\n",
    "\n",
    "# Close the connection\n",
    "db.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4. Create `dataset_data_subset` by combining `cleaned_dataset_data` table with `filedata_summary_view` view.\n",
    "The `cleaned_dataset_data` table has more unique `dataset_ids` than `filedata_summary`. Join the two tables to return a combined table that only contains matching records that appear in the `filedata_summary_view` view. Note that we create `dataset_data_subset` as a table, rather than a view, because we will make additions to the table (e.g. add columns) to support statistical analyses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Open the dataquality database\n",
    "db = sqlite3.connect(db_path)\n",
    "\n",
    "# Get the cursor\n",
    "c = db.cursor()\n",
    "\n",
    "sql = 'CREATE TABLE dataset_data_subset AS SELECT filedata_summary_view.download_count, filedata_summary_view.num_PSF, filedata_summary_view.num_RF, cleaned_dataset_data.dataset_id, cleaned_dataset_data.dataset_doi_url, cleaned_dataset_data.cc0,cleaned_dataset_data.terms_of_use, cleaned_dataset_data.pub_year, cleaned_dataset_data.latest_version_year, cleaned_dataset_data.dataset_version, cleaned_dataset_data.description_length, cleaned_dataset_data.subject, cleaned_dataset_data.keyword_count, cleaned_dataset_data.datasets_with_relpubs, cleaned_dataset_data.dataverse_id, cleaned_dataset_data.dataverse_url, cleaned_dataset_data.dataverse_cat, cleaned_dataset_data.optional_metadata, cleaned_dataset_data.submit_for_review FROM filedata_summary_view LEFT JOIN cleaned_dataset_data ON (filedata_summary_view.dataset_id = cleaned_dataset_data.dataset_id);'\n",
    "\n",
    "# Execute the sql\n",
    "c.execute(sql)\n",
    "\n",
    "# Get the subset from the database and save to an external file\n",
    "df = pd.read_sql('SELECT * from dataset_data_subset', db)\n",
    "df.to_csv(dataset_data_subsets_tsv, sep='\\t')\n",
    "\n",
    "# Close the connection\n",
    "db.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5. Add new columns to `dataset_data_subset` to hold information relevant to performing analyses related to optional feature use. \n",
    "These variables may be assigned the value of either 1 (TRUE) or 0 (FALSE). The following variables are added, and set to 0 (FALSE) initially:\n",
    "1. **`has_PSF`:** dataset has possible supplemental files\n",
    "1. **`has_OMB`:** dataset has one or more optional metadata blocks\n",
    "1. **`has_ATOU`:** dataset has additional terms of use\n",
    "1. **`has_KW`:** dataset has one or more keywords\n",
    "1. **`has_DESC`:** dataset has description of one or more words\n",
    "1. **`has_RPUBS`:** dataset has related publications\n",
    "1. **`has_RF`:** dataset has one or more restricted files\n",
    "1. **`has_REVIEW`:** dataset requires a review before submission\n",
    "1. **`ofus`:** optional feature use score, the sum of the following variables: `has_OMB`, `has_ATOU`, `has_KW`, `has_DESC`, and `has_PSF`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Open the dataquality database\n",
    "db = sqlite3.connect(db_path)\n",
    "\n",
    "# Get the cursor\n",
    "c = db.cursor()\n",
    "\n",
    "# Alter table sql statement, to add has_PSF field to dataset\n",
    "# Variable: has possible supplemental files\n",
    "sql = 'ALTER TABLE dataset_data_subset ADD has_PSF INTEGER;'\n",
    "\n",
    "# Set initial values in the new has_PSF column to 0\n",
    "sql = sql + 'UPDATE dataset_data_subset SET has_PSF = 0;'\n",
    "\n",
    "# Alter table sql statement, to add has_OMB field to dataset\n",
    "# Variable: has one or more optional metadata blocks\n",
    "sql = sql + 'ALTER TABLE dataset_data_subset ADD has_OMB INTEGER;'\n",
    "\n",
    "# Set initial values in the new has_OMB column to 0\n",
    "sql = sql + 'UPDATE dataset_data_subset SET has_OMB = 0;'\n",
    "\n",
    "# Alter table sql statement, to add has_ATOU field to dataset \n",
    "# Variable: has additional terms of use\n",
    "sql = sql + 'ALTER TABLE dataset_data_subset ADD has_ATOU INTEGER;'\n",
    "\n",
    "# Set initial values in the new has_ATOU column to 0\n",
    "sql = sql + 'UPDATE dataset_data_subset SET has_ATOU = 0;'\n",
    "\n",
    "# Alter table sql statement, to add has_KW field to dataset \n",
    "# Variable: has keywords\n",
    "sql = sql + 'ALTER TABLE dataset_data_subset ADD has_KW INTEGER;'\n",
    "\n",
    "# Set initial values in the new has_KW column to 0\n",
    "sql = sql + 'UPDATE dataset_data_subset SET has_KW = 0;'\n",
    "\n",
    "# Alter table sql statement, to add has_DESC field to dataset \n",
    "# Variable: has description\n",
    "sql = sql + 'ALTER TABLE dataset_data_subset ADD has_DESC INTEGER;'\n",
    "\n",
    "# Set initial values in the new has_DESC column to 0\n",
    "sql = sql + 'UPDATE dataset_data_subset SET has_DESC = 0;'\n",
    "\n",
    "# Alter table sql statement, to add has_RPUBS field to dataset \n",
    "# Variable: has related publications\n",
    "sql = sql + 'ALTER TABLE dataset_data_subset ADD has_RPUBS INTEGER;'\n",
    "\n",
    "# Set initial values in the new has_RPUBS column to 0\n",
    "sql = sql + 'UPDATE dataset_data_subset SET has_RPUBS = 0;'\n",
    "\n",
    "# Alter table sql statement, to add has_RF field to dataset \n",
    "# Variable: has restricted files\n",
    "sql = sql + 'ALTER TABLE dataset_data_subset ADD has_RF INTEGER;'\n",
    "\n",
    "# Set initial values in the new has_RF column to 0\n",
    "sql = sql + 'UPDATE dataset_data_subset SET has_RF = 0;'\n",
    "\n",
    "# Alter table sql statement, to add has_REVIEW field to dataset \n",
    "# Variable: requires a review prior to submission\n",
    "sql = sql + 'ALTER TABLE dataset_data_subset ADD has_REVIEW INTEGER;'\n",
    "\n",
    "# Set initial values in the new has_REVIEW column to 0\n",
    "sql = sql + 'UPDATE dataset_data_subset SET has_REVIEW = 0;'\n",
    "\n",
    "# Alter table sql statement, to add ofus field to dataset \n",
    "# Variable: sum of has_OMB, has_ATOU, has_KW, has_DESC, and has_PSF.\n",
    "sql = sql + 'ALTER TABLE dataset_data_subset ADD ofus INTEGER;'\n",
    "\n",
    "# Set initial values in the new ofus column to 0\n",
    "sql = sql + 'UPDATE dataset_data_subset SET ofus = 0;'\n",
    "\n",
    "# Execute the sql\n",
    "c.executescript(sql)\n",
    "\n",
    "# Close the connection\n",
    "db.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6. Populate the new columns with values drawn from other columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Open the dataquality database\n",
    "db = sqlite3.connect(db_path)\n",
    "\n",
    "# Get the cursor\n",
    "c = db.cursor()\n",
    "\n",
    "# Variable: has possible supplemental files\n",
    "# Update for datasets that have one or more possible supplemental files\n",
    "sql = 'UPDATE dataset_data_subset SET has_PSF = 1 WHERE dataset_data_subset.num_PSF >=1;'\n",
    "\n",
    "# Variable: has one or more optional metadata blocks\n",
    "# Update for datasets that have one or more optional metadata blocks\n",
    "sql = sql + 'UPDATE dataset_data_subset SET has_OMB = 1 WHERE dataset_data_subset.optional_metadata >=1;'\n",
    "\n",
    "# Variable: has additional terms of use\n",
    "# Update where dataset has an additional terms of use agreement (field is not null)\n",
    "sql = sql + 'UPDATE dataset_data_subset SET has_ATOU = 1 WHERE dataset_data_subset.terms_of_use IS NOT NULL;'\n",
    "\n",
    "# Variable: has keywords\n",
    "# Update where dataset has one or more keyword\n",
    "sql = sql + 'UPDATE dataset_data_subset SET has_KW = 1 WHERE dataset_data_subset.keyword_count >= 1;'\n",
    "\n",
    "# Variable: has description\n",
    "sql = sql + 'UPDATE dataset_data_subset SET has_DESC = 1 WHERE dataset_data_subset.description_length >= 1;'\n",
    "\n",
    "# Variable: has related publications\n",
    "sql = sql + 'UPDATE dataset_data_subset SET has_RPUBS = 1 WHERE dataset_data_subset.datasets_with_relpubs >= 1;'\n",
    "\n",
    "# Variable: has restricted files\n",
    "sql = sql + 'UPDATE dataset_data_subset SET has_RF = 1 WHERE dataset_data_subset.num_RF >= 1;'\n",
    "\n",
    "# Variable: requires a review prior to submission\n",
    "sql = sql + 'UPDATE dataset_data_subset SET has_REVIEW = 1 WHERE dataset_data_subset.submit_for_review = 1;'\n",
    "\n",
    "# Variable: sum of has_OMB, has_ATOU, has_KW, has_DESC, and has_PSF.\n",
    "sql = sql + 'UPDATE dataset_data_subset SET ofus = has_OMB + has_ATOU + has_KW + has_DESC + has_PSF;'\n",
    "\n",
    "# Execute the sql\n",
    "c.executescript(sql)\n",
    "\n",
    "# Close the connection\n",
    "db.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 7. Save `dataset_data_subset` table as an external file: `dataset_data_subset.txt` for later analysis using SPSS. \n",
    "- **Note:** Use the extention `*.txt` instead of `*.tsv` because SPSS does not understand `*.tsv`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Open the dataquality database\n",
    "db = sqlite3.connect(db_path)\n",
    "\n",
    "# Get the subset from the database and save to an external file\n",
    "df = pd.read_sql('SELECT * from dataset_data_subset', db)\n",
    "df.to_csv(dataset_data_subsets_tsv, sep='\\t',index=False)\n",
    "\n",
    "# Close the connection\n",
    "db.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 8. Use Excel to save `dataset_data_subset.tsv` as `dataset_data_subset.xlxs`\n",
    "- **Note:** We need to resave the file as Excel format because there is a complex record in the file that contains nested quotes that confuses the SPSS tab-separated-values and comma-separated-values parsers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SPSS Data Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Ingest `dataset_data_subset.tsv` into SPSS\n",
    "- Load `dataset_data_subset.txt` into SPSS and manually set all variables to the correct type. For instance, `has_PSF` is set to type `numeric`.\n",
    "- Save SPSS dataset to `{HOME}/hdofu/spss/hdofu_dataset_data_subset.sav`\n",
    "\n",
    "#### Perform statistical analyses\n",
    "- Please see final paper for details"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "End document."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
